{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用TensorBoard可视化训练过程\n",
    "\n",
    "TensorBoard是一个用于追踪、可视化、分析模型训练过程和训练结果的工具，它提供了多种可视化功能，可以与PyTorch、TensorFlow、Keras、Huggingface transformers、ModelScope等机器学习框架一起使用，帮助用户了解模型的训练过程和性能。\n",
    "\n",
    "PAI提供了TensorBoard服务，支持用户在PAI创建TensorBoard应用，用于查看训练作业输出的TensorBoard日志。\n",
    "\n",
    "本文档将以不同的机器学习框架为示例，展示如何在PAI使用TensorBoard追踪和可视化模型训练过程。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 费用说明\n",
    "\n",
    "本示例将会使用以下云产品，并产生相应的费用账单：\n",
    "\n",
    "- PAI-DLC：运行训练任务，详细计费说明请参考[PAI-DLC计费说明](https://help.aliyun.com/zh/pai/product-overview/billing-of-dlc)\n",
    "- OSS：存储训练任务输出的模型、训练代码、TensorBoard日志等，详细计费说明请参考[OSS计费概述](https://help.aliyun.com/zh/oss/product-overview/billing-overview)\n",
    "\n",
    "\n",
    "> 通过参与云产品免费试用，使用**指定资源机型**提交训练作业或是部署推理服务，可以免费试用PAI产品，具体请参考[PAI免费试用](https://help.aliyun.com/zh/pai/product-overview/free-quota-for-new-users)。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "## 安装和配置SDK\n",
    "\n",
    "我们需要首先安装PAI Python SDK以运行本示例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "!python -m pip install --upgrade pai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "SDK需要配置访问阿里云服务需要的AccessKey，以及当前使用的工作空间和OSS Bucket。在PAI SDK安装之后，通过在 **命令行终端** 中执行以下命令，按照引导配置密钥、工作空间等信息。\n",
    "\n",
    "\n",
    "```shell\n",
    "\n",
    "# 以下命令，请在 命令行终端 中执行.\n",
    "\n",
    "python -m pai.toolkit.config\n",
    "\n",
    "```\n",
    "\n",
    "我们可以通过以下代码验证配置是否已生效。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pai\n",
    "from pai.session import get_default_session\n",
    "\n",
    "print(pai.__version__)\n",
    "\n",
    "sess = get_default_session()\n",
    "\n",
    "# 获取配置的工作空间信息\n",
    "assert sess.workspace_name is not None\n",
    "print(sess.workspace_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 提交训练任务"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们首先需要准备训练脚本，使用将PyTorch的TensorBoard utility记录TensorBoard日志。\n",
    "\n",
    "\n",
    "> PyTorch提供的TensorBoard utilities的使用可以见文档： [torch.utils.tensorboard 文档](https://pytorch.org/docs/stable/tensorboard.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p src"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "镜像里需要先安装TensorBoard，可以在训练目录中准备 ``requirements.txt`` 指定需要按照的第三方库。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile src/requirements.txt\n",
    "\n",
    "tensorboard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile src/run.py\n",
    "\n",
    "import os\n",
    "\n",
    "import torch\n",
    "from torch.utils.tensorboard import SummaryWriter\n",
    "\n",
    "\n",
    "# 通过环境变量获取TensorBoard输出路径，默认为 /ml/output/tensorboard/\n",
    "tb_log_dir = os.environ.get(\"PAI_OUTPUT_TENSORBOARD\")\n",
    "print(f\"TensorBoard log dir: {tb_log_dir}\")\n",
    "writer = SummaryWriter(log_dir=tb_log_dir)\n",
    "\n",
    "def train_model(iter):\n",
    "\n",
    "\n",
    "    x = torch.arange(-5, 5, 0.1).view(-1, 1)\n",
    "    y = -5 * x + 0.1 * torch.randn(x.size())\n",
    "\n",
    "    model = torch.nn.Linear(1, 1)\n",
    "    criterion = torch.nn.MSELoss()\n",
    "    optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)\n",
    "\n",
    "    for epoch in range(iter):\n",
    "        y1 = model(x)\n",
    "        loss = criterion(y1, y)\n",
    "        writer.add_scalar(\"Loss/train\", loss, epoch)\n",
    "        optimizer.zero_grad()\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    train_model(100)\n",
    "    writer.flush()\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pai.estimator import Estimator\n",
    "from pai.image import retrieve\n",
    "\n",
    "\n",
    "est = Estimator(\n",
    "    command=\"python run.py\",\n",
    "    source_dir=\"./src\",\n",
    "    image_uri=retrieve(\"PyTorch\", \"latest\").image_uri,\n",
    "    instance_type=\"ecs.c6.large\",\n",
    ")\n",
    "\n",
    "est.fit(wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用TensorBoard应用监控训练"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在PAI启动一个TensorBoard应用，查看使用Estimator的训练作业写出的TensorBoard日志。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tb = est.tensorboard()\n",
    "\n",
    "print(tb.app_uri)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用完成之后，删除TensorBoard应用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tb.delete()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
